2025-05-08-12-04
The Power of Stories: Narrative Priming Shapes How LLM Agents Collaborate and Compete
Abstract
arXiv:2505.03961v1 Announce Type: new Abstract: According to Yuval Noah Harari, large-scale human cooperation is driven by shared narratives that encode common beliefs and values. This study explores whether such narratives can similarly nudge LLM agents toward collaboration. We use a finitely repeated public goods game in which LLM agents choose either cooperative or egoistic spending strategies. We prime agents with stories highlighting teamwork to different degrees and test how this influences negotiation outcomes. Our experiments explore four questions:(1) How do narratives influence negotiation behavior? (2) What differs when agents share the same story versus different ones? (3) What happens when the agent numbers grow? (4) Are agents resilient against self-serving negotiators? We find that story-based priming significantly affects negotiation strategies and success rates. Common stories improve collaboration, benefiting each agent. By contrast, priming agents with different stories reverses this effect, and those agents primed toward self-interest prevail. We hypothesize that these results carry implications for multi-agent system design and AI alignment.
摘要
尤瓦尔·赫拉利提出,大规模人类合作是由编码共同信念与价值观的共享叙事驱动的。本研究探讨此类叙事是否能类似地促进大语言模型智能体之间的协作。我们采用有限重复公共物品博弈框架,其中大语言模型智能体可选择合作性或利己性支出策略。通过向智能体注入不同程度强调团队合作的故事,我们测试了这种干预对谈判结果的影响。实验围绕四个核心问题展开:(1)叙事如何影响谈判行为?(2)智能体共享相同故事与不同故事时有何差异?(3)智能体数量增加时会产生什么变化?(4)智能体能否抵御自利型谈判者的影响?研究发现:基于故事的干预显著影响谈判策略与成功率。共同叙事能提升协作水平,使所有智能体获益;而注入不同故事则会产生相反效果,此时被植入利己倾向的智能体将占据优势。我们推测这些发现对多智能体系统设计与人工智能对齐研究具有启示意义。
Frog Soup: Zero-Shot, In-Context, and Sample-Efficient Frogger Agents
Abstract
arXiv:2505.03947v1 Announce Type: new Abstract: One of the primary aspirations in reinforcement learning research is developing general-purpose agents capable of rapidly adapting to and mastering novel tasks. While RL gaming agents have mastered many Atari games, they remain slow and costly to train for each game. In this work, we demonstrate that latest reasoning LLMs with out-of-domain RL post-training can play a challenging Atari game called Frogger under a zero-shot setting. We then investigate the effect of in-context learning and the amount of reasoning effort on LLM performance. Lastly, we demonstrate a way to bootstrap traditional RL method with LLM demonstrations, which significantly improves their performance and sample efficiency. Our implementation is open sourced at https://github.com/AlienKevin/frogger.
摘要
强化学习研究的主要目标之一是开发能够快速适应并掌握新任务的通用智能体。尽管现有RL游戏智能体已能精通多种Atari游戏,但针对每个新游戏的训练过程仍显缓慢且成本高昂。本研究表明,经过跨领域强化学习后训练的最新推理型大语言模型(LLM)可在零样本设置下玩转名为《青蛙过河》的高难度Atari游戏。我们进一步探究了上下文学习效果及推理努力程度对LLM表现的影响。最后,我们提出一种利用LLM演示数据引导传统RL方法的技术,该方法显著提升了RL算法的性能与样本效率。项目代码已开源:https://github.com/AlienKevin/frogger。
MARCO: A Multi-Agent System for Optimizing HPC Code Generation Using Large Language Models
Abstract
arXiv:2505.03906v1 Announce Type: new Abstract: Large language models (LLMs) have transformed software development through code generation capabilities, yet their effectiveness for high-performance computing (HPC) remains limited. HPC code requires specialized optimizations for parallelism, memory efficiency, and architecture-specific considerations that general-purpose LLMs often overlook. We present MARCO (Multi-Agent Reactive Code Optimizer), a novel framework that enhances LLM-generated code for HPC through a specialized multi-agent architecture. MARCO employs separate agents for code generation and performance evaluation, connected by a feedback loop that progressively refines optimizations. A key innovation is MARCO's web-search component that retrieves real-time optimization techniques from recent conference proceedings and research publications, bridging the knowledge gap in pre-trained LLMs. Our extensive evaluation on the LeetCode 75 problem set demonstrates that MARCO achieves a 14.6% average runtime reduction compared to Claude 3.5 Sonnet alone, while the integration of the web-search component yields a 30.9% performance improvement over the base MARCO system. These results highlight the potential of multi-agent systems to address the specialized requirements of high-performance code generation, offering a cost-effective alternative to domain-specific model fine-tuning.
摘要
大型语言模型(LLM)通过代码生成能力改变了软件开发方式,但其在高性能计算(HPC)领域的应用仍存在局限。HPC代码需要针对并行性、内存效率和特定架构优化的专门处理,而通用LLM往往忽视这些要素。本文提出MARCO(多智能体反应式代码优化器)——一种通过专业化多智能体架构增强LLM生成HPC代码的新型框架。MARCO采用代码生成与性能评估分离的智能体设计,通过反馈循环实现渐进式优化。其核心创新在于网络搜索组件,该组件能从最新会议论文集和研究文献中获取实时优化技术,弥补预训练LLM的知识缺口。基于LeetCode 75题集的全面评估表明:相较于单独使用Claude 3.5 Sonnet,MARCO实现了14.6%的平均运行时降低;而网络搜索组件的集成更使系统性能较基础版MARCO提升30.9%。这些结果证明多智能体系统在满足高性能代码生成特殊需求方面的潜力,为领域专用模型微调提供了经济高效的替代方案。
Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
Abstract
arXiv:2505.04021v1 Announce Type: new Abstract: Serving large language models (LLMs) is expensive, especially for providers hosting many models, making cost reduction essential. The unique workload patterns of serving multiple LLMs (i.e., multi-LLM serving) create new opportunities and challenges for this task. The long-tail popularity of models and their long idle periods present opportunities to improve utilization through GPU sharing. However, existing GPU sharing systems lack the ability to adjust their resource allocation and sharing policies at runtime, making them ineffective at meeting latency service-level objectives (SLOs) under rapidly fluctuating workloads. This paper presents Prism, a multi-LLM serving system that unleashes the full potential of GPU sharing to achieve both cost efficiency and SLO attainment. At its core, Prism tackles a key limitation of existing systems\unicode{x2014}the lack of \textit{cross-model memory coordination}, which is essential for flexibly sharing GPU memory across models under dynamic workloads. Prism achieves this with two key designs. First, it supports on-demand memory allocation by dynamically mapping physical to virtual memory pages, allowing flexible memory redistribution among models that space- and time-share a GPU. Second, it improves memory efficiency through a two-level scheduling policy that dynamically adjusts sharing strategies based on models' runtime demands. Evaluations on real-world traces show that Prism achieves more than cost savings and SLO attainment compared to state-of-the-art systems.
摘要
大型语言模型(LLM)的服务成本高昂,尤其对托管多模型的提供商而言,降低成本至关重要。多LLM服务的独特工作负载模式为此任务带来了新机遇与挑战。模型的长尾流行特性及其长时间空闲状态为通过GPU共享提升利用率创造了条件。然而,现有GPU共享系统缺乏运行时调整资源分配与共享策略的能力,导致其在快速波动的工作负载下难以满足延迟服务等级目标(SLO)。本文提出Prism系统,通过充分释放GPU共享潜力实现成本效益与SLO达标双重目标。其核心解决了现有系统的关键缺陷——缺乏跨模型内存协调机制,该机制对动态负载下模型间灵活共享GPU内存至关重要。Prism通过两项关键设计实现这一目标:首先支持按需内存分配,通过动态映射物理与虚拟内存页,实现时空复用GPU的模型间灵活内存再分配;其次采用二级调度策略提升内存效率,根据模型运行时需求动态调整共享策略。真实场景测试表明,相较最先进系统,Prism可实现超过2倍的成本节约和3.3倍的SLO达标率提升。
LogiDebrief: A Signal-Temporal Logic based Automated Debriefing Approach with Large Language Models Integration
Abstract
arXiv:2505.03985v1 Announce Type: new Abstract: Emergency response services are critical to public safety, with 9-1-1 call-takers playing a key role in ensuring timely and effective emergency operations. To ensure call-taking performance consistency, quality assurance is implemented to evaluate and refine call-takers' skillsets. However, traditional human-led evaluations struggle with high call volumes, leading to low coverage and delayed assessments. We introduce LogiDebrief, an AI-driven framework that automates traditional 9-1-1 call debriefing by integrating Signal-Temporal Logic (STL) with Large Language Models (LLMs) for fully-covered rigorous performance evaluation. LogiDebrief formalizes call-taking requirements as logical specifications, enabling systematic assessment of 9-1-1 calls against procedural guidelines. It employs a three-step verification process: (1) contextual understanding to identify responder types, incident classifications, and critical conditions; (2) STL-based runtime checking with LLM integration to ensure compliance; and (3) automated aggregation of results into quality assurance reports. Beyond its technical contributions, LogiDebrief has demonstrated real-world impact. Successfully deployed at Metro Nashville Department of Emergency Communications, it has assisted in debriefing 1,701 real-world calls, saving 311.85 hours of active engagement. Empirical evaluation with real-world data confirms its accuracy, while a case study and extensive user study highlight its effectiveness in enhancing call-taking performance.
摘要
紧急响应服务对公共安全至关重要,其中9-1-1接警员在确保应急行动及时有效方面发挥着关键作用。为保证接警操作的一致性,需通过质量评估对接警员技能进行持续优化。然而传统人工评估方式难以应对高呼叫量,导致覆盖率不足和评估延迟。本文提出LogiDebrief框架,通过将信号时序逻辑(STL)与大语言模型(LLM)相结合,实现9-1-1呼叫事后分析的自动化处理,完成全覆盖的严格绩效评估。该框架将接警规范转化为逻辑规约,支持基于流程指南的系统化呼叫评估,其三步验证流程包括:(1)通过上下文理解识别响应者类型、事件分类及危急状态;(2)结合LLM的STL运行时检查确保规程合规;(3)自动生成质量评估报告。除技术贡献外,该框架已在纳什维尔市应急通信部门成功部署,累计完成1,701次真实呼叫分析,节省311.85小时人工处理时间。实证研究证实其评估准确性,案例分析与大规模用户研究则验证了其在提升接警绩效方面的有效性。
QStore: Quantization-Aware Compressed Model Storage
Abstract
arXiv:2505.04081v1 Announce Type: new Abstract: Modern applications commonly leverage large, multi-modal foundation models. These applications often feature complex workflows that demand the storage and usage of similar models in multiple precisions. A straightforward approach is to maintain a separate file for each model precision (e.g., INT8, BF16), which is indeed the approach taken by many model providers such as HuggingFace and Ollama. However, this approach incurs excessive storage costs since a higher precision model (e.g., BF16) is a strict superset of a lower precision model (e.g., INT8) in terms of information. Unfortunately, simply maintaining only the higher-precision model and requiring every user to dynamically convert the model precision is not desirable because every user of lower precision models must pay the cost for model download and precision conversion. In this paper, we present QStore, a unified, lossless compression format for simultaneously storing a model in two (high and low) precisions efficiently. Instead of storing low-precision and high-precision models separately, QStore stores low-precision model and only the residual information needed to reconstruct high-precision models. The size of residual information is significantly smaller than the original high-precision models, thus achieving high savings in storage cost. Moreover, QStore does not compromise the speed of model loading. The low-precision models can be loaded quickly just like before. The high-precision models can also be reconstructed efficiently in memory by merging low-precision data and the residual with QStore's lightweight decoding logic. We evaluate QStore for compressing multiple precisions of popular foundation models, and show that QStore reduces overall storage footprint by up to 2.2x (45% of the original size) while enabling up to 1.7x and 1.8x faster model saving and loading versus existing approaches.
摘要
现代应用通常依赖于大型多模态基础模型。这些应用往往涉及复杂的工作流程,需要存储和使用多种精度的相似模型。常见的解决方案是为每种模型精度(如INT8、BF16)单独保存文件,这也是HuggingFace和Ollama等模型提供商采用的方法。然而,这种方法会导致存储成本过高,因为高精度模型(如BF16)在信息量上完全包含低精度模型(如INT8)。单纯只保存高精度模型并要求用户动态转换精度也不可行,因为所有低精度模型用户都必须承担模型下载和精度转换的开销。
本文提出QStore——一种高效存储高低双精度模型的无损统一压缩格式。QStore不再分别存储高低精度模型,而是保存低精度模型及重建高精度模型所需的残差信息。残差信息量远小于原始高精度模型,从而显著降低存储成本。此外,QStore不会影响模型加载速度:低精度模型可如常快速加载,高精度模型也能通过合并低精度数据与残差信息,配合QStore的轻量解码逻辑在内存中高效重建。我们对主流基础模型的多精度压缩进行测试,结果表明QStore最高可减少2.2倍存储空间(原大小的45%),同时模型保存和加载速度分别提升至现有方法的1.7倍和1.8倍。
Can Large Language Models Predict Parallel Code Performance?
Abstract
arXiv:2505.03988v1 Announce Type: new Abstract: Accurate determination of the performance of parallel GPU code typically requires execution-time profiling on target hardware -- an increasingly prohibitive step due to limited access to high-end GPUs. This paper explores whether Large Language Models (LLMs) can offer an alternative approach for GPU performance prediction without relying on hardware. We frame the problem as a roofline classification task: given the source code of a GPU kernel and the hardware specifications of a target GPU, can an LLM predict whether the GPU kernel is compute-bound or bandwidth-bound? For this study, we build a balanced dataset of 340 GPU kernels, obtained from HeCBench benchmark and written in CUDA and OpenMP, along with their ground-truth labels obtained via empirical GPU profiling. We evaluate LLMs across four scenarios: (1) with access to profiling data of the kernel source, (2) zero-shot with source code only, (3) few-shot with code and label pairs, and (4) fine-tuned on a small custom dataset. Our results show that state-of-the-art LLMs have a strong understanding of the Roofline model, achieving 100% classification accuracy when provided with explicit profiling data. We also find that reasoning-capable LLMs significantly outperform standard LLMs in zero- and few-shot settings, achieving up to 64% accuracy on GPU source codes, without profiling information. Lastly, we find that LLM fine-tuning will require much more data than what we currently have available. This work is among the first to use LLMs for source-level roofline performance prediction via classification, and illustrates their potential to guide optimization efforts when runtime profiling is infeasible. Our findings suggest that with better datasets and prompt strategies, LLMs could become practical tools for HPC performance analysis and performance portability.
摘要
准确评估并行GPU代码的性能通常需要在目标硬件上进行执行时间分析——由于高端GPU获取受限,这一步骤日益困难。本文探讨大型语言模型(LLMs)能否在不依赖硬件的情况下提供GPU性能预测的替代方案。我们将该问题构建为屋顶线分类任务:给定GPU内核的源代码和目标GPU的硬件规格,LLM能否预测该内核是计算受限还是带宽受限?
本研究构建了一个包含340个GPU内核的平衡数据集,这些内核来自HeCBench基准测试,采用CUDA和OpenMP编写,并通过实际GPU性能分析获得真实标签。我们在四种场景下评估LLMs:(1)提供内核源码的性能分析数据;(2)仅提供源代码的零样本学习;(3)提供代码-标签对的少样本学习;(4)在小规模定制数据集上微调。
结果表明,最先进的LLMs对屋顶线模型具有深刻理解,当提供明确性能分析数据时分类准确率达100%。我们还发现,具备推理能力的LLMs在零样本和少样本设置中显著优于标准LLMs,在不依赖性能分析信息的情况下,对GPU源代码的分类准确率最高可达64%。最后,我们发现LLM微调所需的数据量远超当前可用规模。
本研究首次利用LLMs通过分类实现源码级屋顶线性能预测,证明了其在无法进行运行时分析时指导优化工作的潜力。研究结果表明,通过更好的数据集和提示策略,LLMs有望成为高性能计算性能分析和性能可移植性的实用工具。
TrajEvo: Designing Trajectory Prediction Heuristics via LLM-driven Evolution
Abstract
arXiv:2505.04480v1 Announce Type: new Abstract: Trajectory prediction is a crucial task in modeling human behavior, especially in fields as social robotics and autonomous vehicle navigation. Traditional heuristics based on handcrafted rules often lack accuracy, while recently proposed deep learning approaches suffer from computational cost, lack of explainability, and generalization issues that limit their practical adoption. In this paper, we introduce TrajEvo, a framework that leverages Large Language Models (LLMs) to automatically design trajectory prediction heuristics. TrajEvo employs an evolutionary algorithm to generate and refine prediction heuristics from past trajectory data. We introduce a Cross-Generation Elite Sampling to promote population diversity and a Statistics Feedback Loop allowing the LLM to analyze alternative predictions. Our evaluations show TrajEvo outperforms previous heuristic methods on the ETH-UCY datasets, and remarkably outperforms both heuristics and deep learning methods when generalizing to the unseen SDD dataset. TrajEvo represents a first step toward automated design of fast, explainable, and generalizable trajectory prediction heuristics. We make our source code publicly available to foster future research at https://github.com/ai4co/trajevo.
摘要
轨迹预测是建模人类行为的关键任务,尤其在社交机器人和自动驾驶导航等领域。基于手工规则的传统启发式方法往往缺乏准确性,而近期提出的深度学习方法则存在计算成本高、可解释性不足以及泛化能力受限等问题,制约了其实际应用。本文提出TrajEvo框架,利用大语言模型(LLMs)自动设计轨迹预测启发式方法。该框架采用进化算法从历史轨迹数据中生成并优化预测启发式规则。我们提出跨代精英抽样策略以增强种群多样性,并建立统计反馈循环机制使LLM能够分析替代预测方案。评估结果表明,TrajEvo在ETH-UCY数据集上优于现有启发式方法,且在迁移至未见过的SDD数据集时,其表现显著超越启发式方法与深度学习方法。TrajEvo为快速、可解释且泛化性强的轨迹预测启发式方法的自动化设计迈出了第一步。我们已公开源代码以促进后续研究:https://github.com/ai4co/trajevo。
Benchmarking LLMs' Swarm intelligence
Abstract
arXiv:2505.04364v1 Announce Type: new Abstract: Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) when operating under strict constraints-such as limited local perception and communication, characteristic of natural swarms-remains largely unexplored, particularly concerning the nuances of swarm intelligence. Existing benchmarks often do not fully capture the unique challenges of decentralized coordination that arise when agents operate with incomplete spatio-temporal information. To bridge this gap, we introduce SwarmBench, a novel benchmark designed to systematically evaluate the swarm intelligence capabilities of LLMs acting as decentralized agents. SwarmBench features five foundational MAS coordination tasks within a configurable 2D grid environment, forcing agents to rely primarily on local sensory input (k x k view) and local communication. We propose metrics for coordination effectiveness and analyze emergent group dynamics. Evaluating several leading LLMs in a zero-shot setting, we find significant performance variations across tasks, highlighting the difficulties posed by local information constraints. While some coordination emerges, results indicate limitations in robust planning and strategy formation under uncertainty in these decentralized scenarios. Assessing LLMs under swarm-like conditions is crucial for realizing their potential in future decentralized systems. We release SwarmBench as an open, extensible toolkit-built upon a customizable and scalable physical system with defined mechanical properties. It provides environments, prompts, evaluation scripts, and the comprehensive experimental datasets generated, aiming to foster reproducible research into LLM-based MAS coordination and the theoretical underpinnings of Embodied MAS. Our code repository is available at https://github.com/x66ccff/swarmbench.
摘要
大语言模型(LLMs)在复杂推理方面展现出潜力,但其在多智能体系统(MAS)中面临严格约束(如自然群体特有的有限局部感知与通信)时,所表现出的涌现协调能力——尤其是群体智能的细微特征——仍亟待探索。现有基准测试往往未能充分体现智能体在时空信息不完整条件下进行分散式协调时产生的独特挑战。为此,我们提出SwarmBench:一个专为系统评估LLMs作为分散式智能体的群体智能能力而设计的新型基准测试。SwarmBench在可配置的2D网格环境中包含五项基础MAS协调任务,强制智能体主要依赖局部感官输入(k×k视野)和局部通信。我们提出了协调效能评估指标,并分析涌现的群体动态。通过对多个领先LLMs进行零样本评估,发现不同任务间存在显著性能差异,凸显了局部信息约束带来的挑战。虽然观察到部分协调行为,但结果表明这些分散场景下智能体在不确定性条件下的稳健规划与策略形成仍存在局限。在类群体条件下评估LLMs,对于实现其在未来分散式系统中的潜力至关重要。我们发布SwarmBench作为开放可扩展工具包——其基于具有明确力学特性的可定制化物理系统构建,提供环境配置、提示模板、评估脚本及完整实验数据集,旨在推动基于LLM的MAS协调与具身MAS理论基础的复现性研究。代码仓库详见https://github.com/x66ccff/swarmbench。
Promoting Security and Trust on Social Networks: Explainable Cyberbullying Detection Using Large Language Models in a Stream-Based Machine Learning Framework
Abstract
arXiv:2505.03746v1 Announce Type: cross Abstract: Social media platforms enable instant and ubiquitous connectivity and are essential to social interaction and communication in our technological society. Apart from its advantages, these platforms have given rise to negative behaviors in the online community, the so-called cyberbullying. Despite the many works involving generative Artificial Intelligence (AI) in the literature lately, there remain opportunities to study its performance apart from zero/few-shot learning strategies. Accordingly, we propose an innovative and real-time solution for cyberbullying detection that leverages stream-based Machine Learning (ML) models able to process the incoming samples incrementally and Large Language Models (LLMS) for feature engineering to address the evolving nature of abusive and hate speech online. An explainability dashboard is provided to promote the system's trustworthiness, reliability, and accountability. Results on experimental data report promising performance close to 90 % in all evaluation metrics and surpassing those obtained by competing works in the literature. Ultimately, our proposal contributes to the safety of online communities by timely detecting abusive behavior to prevent long-lasting harassment and reduce the negative consequences in society.
摘要
社交媒体平台实现了即时且无处不在的连接,在我们这个技术社会中对于社交互动和沟通至关重要。尽管有诸多优势,这些平台也催生了在线社区中的负面行为,即所谓的网络欺凌。尽管近来文献中已有许多涉及生成式人工智能(AI)的研究,但除了零样本/少样本学习策略外,其性能仍有待探索。为此,我们提出了一种创新的实时网络欺凌检测解决方案,该方案利用基于流的机器学习(ML)模型(能够增量处理输入样本)和大型语言模型(LLM)进行特征工程,以应对在线侮辱性和仇恨言论的演变特性。我们还提供了一个可解释性仪表盘,以提升系统的可信度、可靠性和可问责性。实验数据结果显示,所有评估指标均接近90%,性能优异,且超越了文献中同类工作的成果。最终,我们的方案通过及时检测侮辱性行为来防止长期骚扰并减少社会负面影响,从而为在线社区的安全做出贡献。
APSQ: Additive Partial Sum Quantization with Algorithm-Hardware Co-Design
Abstract
arXiv:2505.03748v1 Announce Type: cross Abstract: DNN accelerators, significantly advanced by model compression and specialized dataflow techniques, have marked considerable progress. However, the frequent access of high-precision partial sums (PSUMs) leads to excessive memory demands in architectures utilizing input/weight stationary dataflows. Traditional compression strategies have typically overlooked PSUM quantization, which may account for 69% of power consumption. This study introduces a novel Additive Partial Sum Quantization (APSQ) method, seamlessly integrating PSUM accumulation into the quantization framework. A grouping strategy that combines APSQ with PSUM quantization enhanced by a reconfigurable architecture is further proposed. The APSQ performs nearly lossless on NLP and CV tasks across BERT, Segformer, and EfficientViT models while compressing PSUMs to INT8. This leads to a notable reduction in energy costs by 28-87%. Extended experiments on LLaMA2-7B demonstrate the potential of APSQ for large language models. Code is available at https://github.com/Yonghao-Tan/APSQ.
摘要
深度神经网络(DNN)加速器在模型压缩和专用数据流技术的推动下取得了显著进展。然而,在采用输入/权重静态数据流的架构中,高精度部分和(PSUM)的频繁访问导致内存需求过高。传统压缩策略通常忽视PSUM量化,而这一环节可能占据69%的功耗。本研究提出了一种新颖的加法部分和量化(APSQ)方法,将PSUM累加无缝集成至量化框架中。进一步提出了一种分组策略,将APSQ与可重构架构增强的PSUM量化相结合。实验表明,APSQ在BERT、Segformer和EfficientViT模型的自然语言处理与计算机视觉任务上实现了近乎无损的INT8精度PSUM压缩,同时将能耗显著降低28-87%。针对LLaMA2-7B的扩展实验验证了APSQ在大型语言模型中的应用潜力。代码已开源:https://github.com/Yonghao-Tan/APSQ。
Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
Abstract
arXiv:2505.03756v1 Announce Type: cross Abstract: Multiple Low-Rank Adapters (Multi-LoRAs) are gaining popularity for task-specific Large Language Model (LLM) applications. For multi-LoRA serving, caching hot KV caches and LoRA adapters in high bandwidth memory of accelerations can improve inference performance. However, existing Multi-LoRA inference systems fail to optimize serving performance like Time-To-First-Toke (TTFT), neglecting usage dependencies when caching LoRAs and KVs. We therefore propose FASTLIBRA, a Multi-LoRA caching system to optimize the serving performance. FASTLIBRA comprises a dependency-aware cache manager and a performance-driven cache swapper. The cache manager maintains the usage dependencies between LoRAs and KV caches during the inference with a unified caching pool. The cache swapper determines the swap-in or out of LoRAs and KV caches based on a unified cost model, when the HBM is idle or busy, respectively. Experimental results show that ELORA reduces the TTFT by 63.4% on average, compared to state-of-the-art works.
摘要
多低秩适配器(Multi-LoRAs)在任务特定的大语言模型(LLM)应用中日益普及。对于多LoRA服务,将热门的KV缓存和LoRA适配器缓存在加速器的高带宽内存中可以提高推理性能。然而,现有的多LoRA推理系统未能优化服务性能(如首次令牌时间,TTFT),在缓存LoRA和KV时忽略了使用依赖关系。因此,我们提出了FASTLIBRA,一种多LoRA缓存系统,以优化服务性能。FASTLIBRA包括一个依赖感知的缓存管理器和一个性能驱动的缓存交换器。缓存管理器在推理过程中通过统一的缓存池维护LoRA和KV缓存之间的使用依赖关系。缓存交换器基于统一的成本模型,分别在HBM空闲或繁忙时决定LoRA和KV缓存的换入或换出。实验结果表明,与最先进的工作相比,ELORA平均将TTFT降低了63.4%。
AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
Abstract
arXiv:2505.03745v1 Announce Type: cross Abstract: Recently, large language models (LLMs) have achieved huge success in the natural language processing (NLP) field, driving a growing demand to extend their deployment from the cloud to edge devices. However, deploying LLMs on resource-constrained edge devices poses significant challenges, including (1) intensive computations and huge model sizes, (2) great memory and bandwidth demands introduced by the autoregressive generation process, and (3) limited scalability for handling long sequences. To address these challenges, we propose AccLLM, a comprehensive acceleration framework that enables efficient and fast long-context LLM inference through algorithm and hardware co-design. At the algorithmic level, we integrate (1) pruning, (2) {\Lambda}-shaped attention, and (3) an innovative W2A8KV4 (2-bit weights, 8-bit activations, and 4-bit KV cache) quantization scheme, thus effectively reducing memory and bandwidth requirements while facilitating LLMs' long-sequence generation. At the hardware level, we design a dedicated FPGA-based accelerator with a reconfigurable computing engine to effectively and flexibly accommodate diverse operations arising from our compression algorithm, thereby fully translating the algorithmic innovations into tangible hardware efficiency. We validate AccLLM on the Xilinx Alveo U280 FPGA, demonstrating a 4.07x energy efficiency and a 2.98x throughput compared to the state-of-the-art work FlightLLM.
摘要
近年来,大语言模型(LLMs)在自然语言处理(NLP)领域取得了巨大成功,推动了将其部署从云端扩展到边缘设备的迫切需求。然而,在资源受限的边缘设备上部署LLMs面临重大挑战,包括:(1)密集的计算和庞大的模型规模,(2)自回归生成过程带来的高内存和带宽需求,以及(3)处理长序列时的有限可扩展性。为应对这些挑战,我们提出AccLLM,一种通过算法与硬件协同设计的综合加速框架,实现高效快速的长上下文LLM推理。在算法层面,我们整合了(1)剪枝,(2)Λ形注意力机制,以及(3)创新的W2A8KV4(2位权重、8位激活和4位KV缓存)量化方案,从而有效降低内存和带宽需求,同时提升LLMs的长序列生成能力。在硬件层面,我们设计了一款基于FPGA的专用加速器,配备可重构计算引擎,以高效灵活地适配压缩算法产生的多样化操作,从而将算法创新充分转化为实际的硬件效能。我们在Xilinx Alveo U280 FPGA上验证了AccLLM,相较于最先进的工作FlightLLM,能效提升4.07倍,吞吐量提高2.98倍。
GPU Performance Portability needs Autotuning
Abstract
arXiv:2505.03780v1 Announce Type: cross Abstract: As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with kernel parameter autotuning to enable portable, state-of-the-art performance LLM execution without code changes. Focusing on flash attention -- a widespread performance-critical LLM kernel -- we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.
摘要
随着大型语言模型(LLM)复杂度不断提升,要实现最先进性能需要在算法、软件和硬件之间进行紧密协同设计。当前对单一主导平台的依赖限制了可移植性,造成供应商锁定,并抬高了新型AI硬件的准入门槛。本研究提出将即时(JIT)编译与内核参数自动调优相结合,无需修改代码即可实现可移植的、最先进性能的LLM执行。以广泛使用的性能关键型LLM内核——闪存注意力机制为例,我们证明该方法可探索多达15倍的参数配置组合,在多个维度上生成显著更多样化的代码,甚至能以最高230%的优势超越供应商优化实现,同时将内核代码量减少70倍并消除手工代码优化。研究结果表明,自动调优是解锁跨GPU供应商模型可移植性的一条重要途径。
Splitwiser: Efficient LM inference with constrained resources
Abstract
arXiv:2505.03763v1 Announce Type: cross Abstract: Efficient inference of LLMs remains a crucial challenge, with two main phases: a compute-intensive prompt computation and a memory-intensive token generation. Despite existing batching and scheduling techniques, token generation phases fail to fully utilize compute resources, especially when compared to prompt computation phases. To address these challenges, we propose Splitwiser, a methodology that splits the two phases of an LLM inference request onto the same GPU, thereby reducing overhead and improving memory access and cache utilization. By eliminating the need to transfer data across devices, Splitwiser aims to minimize network-related overheads. In this report, we describe the basic structure of our proposed pipeline while sharing preliminary results and analysis. We implement our proposed multiprocessing design on two widely-used and independent LLM architectures: Huggingface and vLLM. We open-source our code for the respective implementations: 1) Huggingface (https://github.com/asad-aali/splitwiser), and 2) vLLM (https://github.com/adney11/vllm-sysml).
摘要
大型语言模型(LLM)的高效推理仍面临关键挑战,其包含两个主要阶段:计算密集型的提示计算阶段和内存密集型的令牌生成阶段。尽管现有批处理与调度技术已取得进展,但令牌生成阶段的计算资源利用率仍不足,尤其在对比提示计算阶段时表现明显。为解决这些问题,我们提出Splitwiser方法,该方法将LLM推理请求的两个阶段拆分至同一GPU上执行,从而降低开销并提升内存访问与缓存利用率。通过消除跨设备数据传输需求,Splitwiser旨在最小化网络相关开销。本报告阐述了所提出流水线的基本结构,并分享了初步实验结果与分析。我们在两种广泛使用且独立的LLM架构(Huggingface与vLLM)上实现了该多进程设计方案,相关代码已开源:1)Huggingface实现(https://github.com/asad-aali/splitwiser);2)vLLM实现(https://github.com/adney11/vllm-sysml)。
Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding
Abstract
arXiv:2505.03788v1 Announce Type: cross Abstract: We introduce a novel approach for calibrating uncertainty quantification (UQ) tailored for multi-modal large language models (LLMs). Existing state-of-the-art UQ methods rely on consistency among multiple responses generated by the LLM on an input query under diverse settings. However, these approaches often report higher confidence in scenarios where the LLM is consistently incorrect. This leads to a poorly calibrated confidence with respect to accuracy. To address this, we leverage cross-modal consistency in addition to self-consistency to improve the calibration of the multi-modal models. Specifically, we ground the textual responses to the visual inputs. The confidence from the grounding model is used to calibrate the overall confidence. Given that using a grounding model adds its own uncertainty in the pipeline, we apply temperature scaling - a widely accepted parametric calibration technique - to calibrate the grounding model's confidence in the accuracy of generated responses. We evaluate the proposed approach across multiple multi-modal tasks, such as medical question answering (Slake) and visual question answering (VQAv2), considering multi-modal models such as LLaVA-Med and LLaVA. The experiments demonstrate that the proposed framework achieves significantly improved calibration on both tasks.
摘要
我们提出了一种针对多模态大语言模型(LLM)不确定性量化(UQ)校准的新方法。现有最先进的UQ方法依赖于LLM在不同设置下对输入查询生成多个响应的一致性。然而,这些方法在LLM持续出错的场景中往往会报告更高的置信度,导致置信度与准确率之间的校准效果不佳。为解决这一问题,我们在自一致性基础上引入跨模态一致性来改进多模态模型的校准。具体而言,我们将文本响应锚定于视觉输入,利用锚定模型的置信度来校准整体置信度。鉴于使用锚定模型会在流程中引入其自身的不确定性,我们采用温度缩放(一种广泛接受的参数化校准技术)来校准锚定模型对生成响应准确性的置信度。我们在多个多模态任务(如医学问答Slake和视觉问答VQAv2)上评估所提方法,测试模型包括LLaVA-Med和LLaVA等多模态模型。实验表明,该框架在两项任务上均实现了显著改进的校准效果。
Large Language Model Compression with Global Rank and Sparsity Optimization
Abstract
arXiv:2505.03801v1 Announce Type: cross Abstract: Low-rank and sparse composite approximation is a natural idea to compress Large Language Models (LLMs). However, such an idea faces two primary challenges that adversely affect the performance of existing methods. The first challenge relates to the interaction and cooperation between low-rank and sparse matrices, while the second involves determining weight allocation across different layers, as redundancy varies considerably among them. To address these challenges, we propose a novel two-stage LLM compression method with the capability of global rank and sparsity optimization. It is noteworthy that the overall optimization space is vast, making comprehensive optimization computationally prohibitive. Therefore, to reduce the optimization space, our first stage utilizes robust principal component analysis to decompose the weight matrices of LLMs into low-rank and sparse components, which span the low dimensional and sparse spaces containing the resultant low-rank and sparse matrices, respectively. In the second stage, we propose a probabilistic global optimization technique to jointly identify the low-rank and sparse structures within the above two spaces. The appealing feature of our approach is its ability to automatically detect the redundancy across different layers and to manage the interaction between the sparse and low-rank components. Extensive experimental results indicate that our method significantly surpasses state-of-the-art techniques for sparsification and composite approximation.
摘要
低秩与稀疏复合近似是压缩大语言模型(LLM)的自然思路。然而,该方法面临两大核心挑战,严重影响现有技术的性能:其一涉及低秩矩阵与稀疏矩阵的交互协作问题,其二在于不同网络层的权重分配策略,因其冗余度存在显著差异。针对这些挑战,我们提出一种具备全局秩与稀疏度优化能力的新型两阶段LLM压缩方法。值得注意的是,整体优化空间极为庞大,使得全局优化在计算上难以实现。为此,第一阶段采用鲁棒主成分分析将LLM权重矩阵分解为低秩分量与稀疏分量,二者分别生成包含结果矩阵的低维空间与稀疏空间。第二阶段提出概率化全局优化技术,在上述双空间中联合识别低秩与稀疏结构。本方法的突出优势在于能自动检测不同层级的冗余特征,并有效协调稀疏分量与低秩分量的相互作用。大量实验结果表明,该方法在稀疏化与复合近似任务上显著优于当前最先进技术。
LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection
Abstract
arXiv:2505.03793v1 Announce Type: cross Abstract: The proliferation of open-sourced Large Language Models (LLMs) and diverse downstream tasks necessitates efficient model selection, given the impracticality of fine-tuning all candidates due to computational constraints. Despite the recent advances in LLM selection, a fundamental research question largely remains nascent: how can we model the dynamic behaviors of LLMs during fine-tuning, thereby enhancing our understanding of their generalization performance across diverse downstream tasks? In this work, we propose a novel theoretical framework that provides a proper lens to assess the generalization capabilities of LLMs, thereby enabling accurate and efficient LLM selection for downstream applications. In particular, we first derive a Hessian-based PAC-Bayes generalization bound that unveils fine-tuning dynamics of LLMs and then introduce LENSLLM, a Neural Tangent Kernel(NTK)-based Rectified Scaling Model that enables accurate performance predictions across diverse tasks while maintaining computational efficiency. Extensive empirical results on 3 large-scale benchmarks demonstrate that our model achieves up to 91.1% accuracy and reduces up to 88.5% computational cost in LLM selection, outperforming 5 state-of-the-art methods. We open-source our proposed LENSLLM model and corresponding results at the Github link: https://github.com/Susan571/LENSLLM.git.
摘要
随着开源大型语言模型(LLMs)的激增和下游任务的多样化,在计算资源受限导致无法对所有候选模型进行微调的情况下,高效的模型选择变得至关重要。尽管近期LLM选择研究取得了进展,但一个核心科学问题仍处于萌芽阶段:如何建模LLMs在微调过程中的动态行为,从而深化我们对其在不同下游任务中泛化性能的理解?本研究提出了一种新颖的理论框架,为评估LLMs的泛化能力提供了有效视角,从而实现下游应用中精准高效的LLM选择。具体而言,我们首先推导出基于Hessian矩阵的PAC-Bayes泛化界,揭示了LLMs的微调动态特性;继而提出LENSLLM模型——一种基于神经正切核(NTK)的修正缩放模型,该模型能在保持计算效率的同时,精准预测跨任务性能表现。在3个大规模基准测试上的实验结果表明,我们的模型在LLM选择中最高可达91.1%的准确率,并降低88.5%的计算成本,性能优于5种最先进方法。我们已将LENSLLM模型及相关成果开源,GitHub链接:https://github.com/Susan571/LENSLLM.git。
Efficient Fine-Tuning of Quantized Models via Adaptive Rank and Bitwidth
Abstract
arXiv:2505.03802v1 Announce Type: cross Abstract: QLoRA effectively combines low-bit quantization and LoRA to achieve memory-friendly fine-tuning for large language models (LLM). Recently, methods based on SVD for continuous update iterations to initialize LoRA matrices to accommodate quantization errors have generally failed to consistently improve performance. Dynamic mixed precision is a natural idea for continuously improving the fine-tuning performance of quantized models, but previous methods often optimize low-rank subspaces or quantization components separately, without considering their synergy. To address this, we propose \textbf{QR-Adaptor}, a unified, gradient-free strategy that uses partial calibration data to jointly search the quantization components and the rank of low-rank spaces for each layer, thereby continuously improving model performance. QR-Adaptor does not minimize quantization error but treats precision and rank allocation as a discrete optimization problem guided by actual downstream performance and memory usage. Compared to state-of-the-art (SOTA) quantized LoRA fine-tuning methods, our approach achieves a 4.89% accuracy improvement on GSM8K, and in some cases even outperforms the 16-bit fine-tuned model while maintaining the memory footprint of the 4-bit setting.
摘要
QLoRA通过有效结合低位数量化和LoRA技术,实现了对大语言模型(LLM)的内存友好型微调。近期基于SVD的连续更新迭代方法虽尝试通过初始化LoRA矩阵来适应量化误差,但普遍未能持续提升性能。动态混合精度是持续改进量化模型微调性能的自然思路,但现有方法往往分别优化低秩子空间或量化组件,未考虑二者的协同效应。为此,我们提出 extbf{QR-Adaptor}——一种无需梯度的统一策略,利用部分校准数据联合搜索每层的量化组件和低秩空间秩数,从而持续提升模型性能。该方法不最小化量化误差,而是将精度与秩分配视为受实际下游性能和内存使用指导的离散优化问题。相比最先进(SOTA)的量化LoRA微调方法,我们的方案在GSM8K上实现了4.89%的准确率提升,某些情况下甚至优于16位微调模型,同时保持4位设置的内存占用。